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ABSTRACT 

Searching for Darwinian selection in natural popula- 
tions has been the focus of a multitude of studies 
over the last decades. Here we present the 1000 
Genomes Selection Browser 1.0 (http://hsb.upf.edu) 
as a resource for signatures of recent natural selec- 
tion in modern humans. We have implemented and 
applied a large number of neutrality tests as well as 
summary statistics informative for the action of se- 
lection such as Tajima's D, CLR, Fay and Wu's H, Fu 
and Li's F* and D*, XPEHH, AiHH, iHS, Fst, ADAF and 
XPCLR among others to low coverage sequencing 
data from the 1000 genomes project (Phase 1; 
release April 2012). We have implemented a publicly 
available genome-wide browser to communicate the 
results from three different populations of West 
African, Northern European and East Asian ancestry 
(YRI, CEU, CHB). Information is provided in UCSC- 
style format to facilitate the integration with the rich 
UCSC browser tracks and an access page is provided 
with instructions and for convenient visualization. We 
believe that this expandable resource will facilitate 
the interpretation of signals of selection on different 
temporal, geographical and genomic scales. 

INTRODUCTION 

Initiatives such as the 1000 Genomes Project (1,2) are 
generating resequencing data from world-wide hmnan 



populations on a genome-wide scale. Resequencing data 
constitutes a major leap for population genomic analysis 
due to its higher information density and limited SNP as- 
certainment bias compared to genotyping data. Therefore 
such data is appropriate to calculate summary statistics 
that are based on the site frequency spectrum like CLR 
or Tajima's D. Using the neutral evolutionary model as a 
null hypothesis, diverse statistics can be apphed to genetic 
data to identify deviations from neutrality (Table 1). 
These statistical tests show varying degrees of robustness 
to demographic events (e.g. population bottlenecks and 
expansions) and sensitivity to different types of selection 
(e.g. positive, purifying or balancing). For instance, popu- 
lation bottlenecks, can lead to footprints that are similar 
to those caused by positive selection (21). Therefore, 
outlier approaches, which are commonly used to identify 
non-neutral loci in the extremes of a genome-wide distri- 
bution, are likely to contain a number of false positives in 
their extremes. Likewise, a number of false negatives, 
hence misidentified truly selected loci, are expected in a 
grey zone near the (arbitrary) outher threshold (22). 
Outlier approaches in genome scans have proven 
powerful, but certainly they should be interpreted care- 
fully in order to avoid storytelling (23). Even more, a 
profound understanding of adaptive evolution requires 
the integration of biological function (24) and if 
possible, validation on an experimental basis (25). 
Molecular network approaches can also give a functional 
context to the specific genes under adaptive selection 
(26,27). In all studies, care should be taken in 
communicating putative loci under selection to the 
pubUc in order to avoid racist misinterpretation (28). 
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Table 1. List of available summary statistics 
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Despite of these limitations and the fact that complete 
selective sweeps may not be extremely widespread in 
humans (29), a large number of regions under strong 
positive selection can be expected in the genome (30). 

DESCRIPTION OF APPLIED STATISTICAL TESTS 

Due to linkage, neutral alleles in the surrounding region 
hitchhike with the selected allele. Maynard Smith and 
Haigh (31) described this process of genetic hitchhiking 
and the so-called selective sweep. More recent studies 
showed that genetic hitchhiking generates distinct poly- 
morphism signatures on the genome such as: (i) reduction 
of polymorphism level and excess of low- and high- 
frequency derived variants (32), (ii) spatial patterns of 
hnkage-disequihbrium (33) and (iii) increased genetic dif- 
ferentiation among populations (34). Taking advantage of 
these three theoretical expectations, several methods to 
detect positive selection have been developed in the last 
two decades. This makes reference to the fact that no 
single statistic is enough to describe selection under 
various demographic models and modes of selection (22). 

Here, we implemented a large number of statistical tests 
(Table 1) in order to allow for a more comprehensive 
analysis of natural selection, especially, positive selection. 
In brief, we have assigned the statistical tests to different 
method families (Table 1). Within the first family which is 
based on the allele frequency spectrum, Tajima's D (3) is a 
classical neutrality test that compares estimates of 
the number of segregating sites and the mean pair-wise 
difference between sequences. CLR is a multi-locus, com- 
posite likehhood ratio test (4,35). Fay and Wu's H (5) 
uses another facet of the site-frequency spectrum, by 



comparing the number of derived segregating sites at 
high frequencies to the number of variants at intermediate 
frequencies. Fu and Li's F* compares the number of 
singletons to the mean pair-wise difference between 
sequences and Fu and Li's D* compares it to the total 
number of nucleotide variants in a genomic region (6). 
R2 (7) is a statistical test for detecting population 
growth based on the comparison of the difference 
between the number of singletons per sequence and the 
average number of nucleotide differences. 

Among the linkage disequihbrium structure methods, 
XP-EHH (8) is a cross-population test based on 
extended haplotype homozygosity (EHH). AiHH con- 
siders the difference between the integrated haplotype 
homozygosity scores for each allele in a single population 
while iHS (9) is defined as their log ratio. EHH average 
and EHH maximum (36); modified from (10) are based on 
the extended haplotype homozygosity. Wall's B (11) 
counts the number of pairs of adjacent segregating sites 
that are congruent (if the subset of the data consisting of 
the two sites contains only two different haplotypes), while 
Wall's Q (12) adds the number of partitions (two disjoint 
subsets whose union is the set of individuals in the sample) 
induced by congruent pairs to Wall's B. Fu's F (13) takes 
into account the haplotype diversity in the sample. 
Dh (14) is a summary statistic based on the number of 
different haplotypes in the sample. 

The third family of methods is based on population dif- 
ferentiation. Fst (37); calculated following the diploid 
method in Weir 1996 (p. 178) and ADAF (18) are estimates 
of population differentiation based on derived allele 
frequencies. XP-CLR (19) is a multi-locus allele- 
frequency-differentiation statistic between two populations. 
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Additional statistics like segregating sites per 30-kb window 
and the nucleotide diversity and others (Table 1) are listed 
as descriptive statistics. A thorough description of the tests 
is given in the original literature (see Table 1) and in diverse 
excellent reviews on the topic (38,39). 



COMPUTATIONAL FRAMEWORK AND 
DESCRIPTION OF 1000 GENOMES SOURCE DATA 

A framework to calculate diverse summary statistics 
(Table 1) from 1000 genomes data was developed 
(Figure 1). A detailed description of how the statistics 
were implemented is given (Supplementary Material). 
A genome-wide overview of the results stored in the 
database for selected summary statistics is given 
(Supplementary Table SI). As described in the 1000 
genomes Phase 1 paper (1), the quality of the 1000 
genomes low coverage data has improved considerably 
over the pilot phase (2), but a number of limitations 
need to be kept in mind for population genomic 
analysis: (i) singletons and other rare variants are still 
underrepresented, (ii) the accessibility of the genome 
with the used short-read-sequencing technologies ~94% 
and (iii) the reported phasing switch error every 250 kb 
(median, Supplementary Figure S5 in (1)) hkely underesti- 
mates the length of long-shared haplotypes expected to 
occur around recent selective sweeps. Despite of these 
drawbacks which are mainly due to the nature of the 
low coverage approach, the short-read technology and dif- 
ferences in read depth (40), this dataset has important 
advantages over genotyping data, most importantly (i) a 
higher SNP density, (ii) the overcoming of ascertainment 
bias and (iii) a larger number of individuals per popula- 
tion, when compared to previous datasets (HapMap II 
and HGDP). We used phased data from the CEU, the 
CHB and the YRI populations from the integrated 
Phase 1 variant set (April 2012), with 97, 85 and 88 indi- 
viduals, respectively. From the input vcf (variant call 
format) file we extracted exclusively the low-coverage 
VSQR SNP calls in order to avoid any bias that might 
result from differences between low-coverage calls and 
high-coverage exome SNP calls. Indels were not used. 
Ancestral states in this data set were identified using a 
4-way alignment of humans, chimp, orangutan and 
rhesus macaque, provided by the 1000 genomes consor- 
tium (ftp://ftp.1000genomes.ebi.ac.uk/voll/ftp/phasel/ 
analy sis_results /supporting/ancestral_alignments /) . 



AVAILABILITY OF DATABASE 

All data is available via our entry page: http://hsb.upf.edu. 
A search mask gives the user easy access to the results for 
a specific gene or a genomic region of choice. The 'submit' 
button leads the user to a UCSC-style genome browser 
(http://pgb.ibe.upf edu/) which is a custom installation 
of the UCSC Genome Browser (41,42). This UCSC 
Genome Browser installment allows for a visual inspec- 
tion of the data, and for an integration of our data with 
many other available datasets. The raw scores of the 
tracks can be conveniently downloaded using the UCSC 
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Figure 1. Schematic worlcflow developed in order to calculate diverse 
genome-wide summary statistics informative for the action of selection 
and to build a database in order to share and visualize the results. 



Table function (43) and is integrated with the Galaxy 
platform (galaxyproject.org). Using the 'configure' 
function on the browser page, the tracks can be further 
customized and using 'right click' the visualized genomic 
regions can be downloaded as a picture in .png format. 
For every statistical test, we provide two tracks, one for 
the raw scores and one for ranked scores. The purpose of 
the rank score tracks is to provide a comparison to the rest 
of the genome. Conveniently, the rank scores are pre- 
sented in such a way that they present a peak (instead of 
a valley) in regions under positive selection. They are 
calculated using an outlier approach (22,44) by sorting 
all the scores genome- wide and determining the —log 10 
of the rank divided by the number of values in the distri- 
bution, taking the upper tail for most of the tests, or the 
lower tail for Tajima's D, Fay and Wu's H, Fu and Li's F 
and D, R2 and Fu's F (see Table 1 and a more detailed 
description on the entry page). The main purpose of the 
entry page is to provide a channel of communication with 
users, following the guidelines in (45). It serves as a 
platform for updates, questions and feedback (46). 
Therefore the page also provides documentation on the 
tracks and on the tests implemented as well as a FAQ 
and a feedback section. 
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Figure 2. Examples of genomic regions under selection in the 1000 genomes selection browser. Tracks of statistics from different populations are 
visualized in colour (CEU in green, CHB in red and YRI in blue). Additional examples are given at http://hsb.upf edu (A) The p- and q-arms of 
chromosome 2 (— loglO of empirically ranked scores). Recurrent peaks at around 72.5 Mb (left green arrow) and 109.5 Mb (right green arrow) 
indicate the loci CYP26B1/EXOC6B and EDAR, respectively. (B) Signature of positive selection around SLC45A2, another established skin colour 
gene, in the CEU population (0.5-Mb window; — loglO of empirically ranked scores). (C) Widespread balancing selection in the HLA region indicated 
by strongly positive scores for Tajima's D in all three analysed human populations (0.5-Mb window). 
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EXAMPLE APPLICATIONS 

First, we exemplify the use of the database by extract- 
ing results for a number of established loci under selection: 
EDAR (47), LCr(46), SLC45A2 (48), CD36 (49), HERC2 
(50), SLC24A5 (51), CDS (52) and APOLl (53). A loci- 
specific summary of statistical tests is given 
(Supplementary Table S2). Interestingly, for any given 
locus, only a subset of statistical tests shows an extreme 
outher score. This is consistent with differences in the 
architecture of selective sweeps. iHS scores near to 
certain very pronounced selective sweeps (e.g. LCT and 
SLC24A5) failed to compute due to inherent properties 
of the statistics, because either (i) the selected haplotype 
was near fixation or (ii) the EHH did not drop below the 
defined threshold in a given window. Examples for both 
positive {SLC45A2) and balancing (HLA region) selection 
are visualized in Figure 2. As expected, Tajima's D scores 
around HLA (54) as well as the ABO locus (55) (data not 
shown) were pronouncedly elevated in all three analyzed 
populations, a pattern which is compatible with the action 
of balancing selection. 

COMPARISON TO OTHER WEB RESOURCES 

As for positive selection based on between-species com- 
parisons, the Selectome database (http://bioinfo.unil.ch/ 
selectome/; (56)) presents results based on the dN/dS 
method using a branch-site specific hkelihood test. As 
for recent natural selection within modern humans, a 
number of web resources are available. For previous 
datasets, e.g. the HapMap 2 and HGDP projects, 
several positive selection statistics are available in form 
of the haplotter tool (http://haplotter.uchicago.edu/; 
(24)) and in form of the HGDP selection browser 
(http://hgdp.uchicago.edu/; (57)). For the 1000 genomes 
project data, the online tool ENGINES (http://spsmart. 
cesga.es; (58)) is useful for the analysis of allele frequencies 
and a recent study presented a method to calculate cor- 
rected summary statistics from low coverage sequencing 
data (40). dbPSHP (http://jjwanglab.org/dbpshp) offers 
a large number of statistical tests in a SNP-specific 
manner for HapMap 3 and 1000 genomes datasets. 
Complementary to these databases, our database gives a 
large number of region- and SNP-specific scores (depend- 
ing on the test statistic) based on resequencing data (1000 
genomes Phase 1), with a special focus on genome-wide 
significance (by the ranked scores) and the visualization of 
several statistics in parallel (Figure 2). 

CONCLUSIONS 

By applying a large number of summary statistics to data 
from the 1000 genomes project, we have built a timely and 
expandable resource for the population genomics research 
community. An associated user-friendly genome browser 
gives a visual impression of the genetic variation in a 
genomic region of interest and offers functionahty for an 
array of down-stream analyses. While this resource will 
not replace a thorough, case by case analysis of selection, 
we expect that it will prove useful for the research 



community through the large number of test statistics 
and the fine-grained character of resequencing data. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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